Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
33
2.4.4
Distribution Rectification Distillation
Inner-level optimization. We first detail the maximization of self-information entropy.
According to the definition of self information entropy, H(qS) can be implicitly expanded
as:
H(qS) = −
qS
i ∈qS p(qS
i )log p(qS
i ).
(2.30)
However, an explicit form of H(qS) can only be parameterized with a regular distribution
p(qS
i ). Luckily, the statistical results in Fig. 2.8 show that the query distribution tends to
follow a Gaussian distribution, also observed in [136]. This enables us to solve the inner-
level optimization in a distribution alignment fashion. To this end, we first calculate the
mean μ(qS) and variance σ(qS) of query qS whose distribution is then modeled as qS ∼
N(μ(qS), σ(qS)). Then, the self-information entropy of the student query can proceed as:
H(qS) = −E[log N(μ(qS), σ(qS))]
= −E[log[(2πσ(qS)
2)
1
2 exp(−(qS
i −μ(qS))
2
2σ(qS)2
)]]
= 1
2 log 2πσ(qS)
2.
(2.31)
The above objective reaches its maximum of H(qS∗) = (1/2) log 2πe[σ(qS)
2 +ϵqS] when
qS∗= [qS −μ(qS)]/[
σ(qS)2 + ϵqS] where ϵqS = 1e−5 is a small constant added to prevent
a zero denominator. The mean and variance might be inaccurate in practice due to query
data bias. To solve this, we use the concepts in batch normalization (BN) [207, 102] where
a learnable shifting parameter βqS is added to move the mean value. A learnable scaling
parameter γqS is multiplied to move the query to the adaptive position. In this situation,
we rectify the information entropy of the query in the student as follows:
qS∗=
qS −μ(qS)
σ(qS)2 + ϵqS
γqS + βqS,
(2.32)
in which case the maximum self-information entropy of student query becomes H(qS∗) =
(1/2) log 2πe[(σ2
qS + ϵqS)/γ2
qS]. Therefore, in the forward propagation, we can obtain the
current optimal query qS∗via Eq. (2.32), after which, the upper-level optimization is further
executed as detailed in the following contents.
Upper-level optimization. We continue minimizing the conditional information en-
tropy between the student and the teacher. Following DETR [31], we denote the ground-
truth labels by yGT = {cGT
i
, bGT
i
}Ngt
i=1 as a set of ground-truth objects where Ngt is the num-
ber of foregrounds, cGT
i
and bGT
i
respectively represent the class and coordinate (bounding
box) for the i-th object. In DETR, each query is associated with an object. Therefore, we
can obtain N objects for teacher and student as well, denoted as yS = {cS
j , bS
j }N
j=1 and
yT = {cT
j , bT
j }N
j=1.
The minimization of the conditional information entropy requires the student and
teacher objects to be in a one-to-one matching. However, it is problematic for DETR due
primarily to the sparsity of prediction results and the instability of the query’s predic-
tions [129]. To solve this, we propose a foreground-aware query matching to rectify “well-
matched” queries. Concretely, we match the ground-truth bounding boxes with this student
to find the maximum coincidence as:
Gi = max
1≤j≤N GIoU(bGT
i
, bS
j ),
(2.33)